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INTRODUCTION - 

IIT Research Institute has oondxicted several research pro- 
grams under contracts and grants from the National Science 
Foundation Office of Science Information Sei)^lces< While the 
earlier^ programs (1968-1972 period) were pximarlly concerned 
with developing viable information retrieval systems, recent 
vork has concentrated on improvement in the quality and cost- 
effectiveness of rcftrleyal. These programs have been aimed 
toward the discovery of Information useful to a variety of sys- 
tems and system operators. I shall discuss two of these projects 
briefly today. One has been completed, and detailed Information 
is available in the Final Report. The other program is current - 
a semi-annual report is now available, and a Final Report will 
be distributed at the end of this year. ^ 

STUDY OF INDEXING AND INFORMATION DISPLAY 

The first project was conducted from November 1973 to April^ 
1975. ^ J — 

Program Goals 

Studies of indexing and display have been performed for many 
years « They have: ranged from studies of the value of content in- 
dexing as opposed to citation indexing to studies of automated 
indexing. Many of these failed to be definitive because nb user 
community had available one large data bank containing all of the 
variables to be tested. 

Using the facilities of the Computer Search Center (CSC), IIT 
Research Institute (IITRI) was able to overcome son^ of these limi- 
tations. The recently completed program was possible because of: 

• the existence of large machine -readable bibliographic 
record files representing the same set of docuidjSnts 
in different but related ways* 

*For this program, a bibliographic record is comprised of citation 
Information (author and location, title, source (e<g. journal), 
source subelements (e.g< volume, issue, pagination, date^), plus 
additional information such as abstracts, index terms., molecular 
formulae, etc. 



' • the existence of software capable of performing Identical 
manipulations on each of the above, 

• the existence of well^tested user questions, and 

• -the existence and availability of people familiar with 
the data baseCs) .and software system. 

Data bases used In this study were the Chemical Abstracts Services 
CondenrstfCes , cAsiA and CBAC - all converted to IITRI-format for 
searching purposes. 

Program Strategy 

This was a three -experiment program designed to quantify 
several aspects of Indexing and Information display. 

1. Effect of Record Completeness on Relevancy Judggnent 

This experlusent tested the effects of completeness of 
racords-^n- ^ e tr le v al-e f J E ic l e n cy^^^t he sam e se t ^of-qi 



tldffjf^yas used to search several versions of the same 
data h^BB, These versions differed In the amount of 
mate'is^l^j^^ available for searching (titles only, plus 
index terms, etc.) 

II. ' Effect of Indexi ng Methodology on Retrieval Efficiency 

This experiment studied several Indexing methods for 
their effects upon retrieval efficiency. 

III." Effect of Information Display on Relevancy Judgement 

This experiment tested the effects of completeness of 
display on relevance judgement. The total set of re- 
trieved documents , found by all search methods , was 
judged for relevance. Only certain parts of the rec- 
ords were given to each reviewer, to determine the 
effects of record completeness on their judgement of 
relevancy. 

Program Findings 

Experiment #1 - Effect of Record Completeness on Relevancy Judgement 



Three levels of record completeness (citation information only. 



citation information plus keywords , and citation information 
plus keywords and abstract) were searched. 

This experiment showed that the more complete a record, the more 
likely is its selection as a hit. The overall average selection 
for citation information was 31.68%; for citation information 
plus keywords was 54.50%; and for citation information plus key- 
words and abstracts was 76.29%. Adding keywords increased the 
likelihood of getting a hit by 23% idiile adding the abstracts 
also increased the likelihood of a hit by another 22%. The ad- 
dition of all the other elements (index terms » Registry Numbers, 
molecular formula, etc.) added another 247*. 

While CSC costs are not absolutely congruous to those of others, 
the data in Figure 1 indicate the expected increase in cost with 
increase in the size and complexity of a data base. Keywords re- 
turn quite a bit in performance for little incremental cost. The 
searching of full abstracts provides a similar increase in perfor- 

mflnG<> Jl^ a cn^^1a^ Herf^M ^ inrrpflBp'in rosr. — Th e d oll a r - figur e 8 a r e 

normalized to $100.00. 

Figxire 1 

* * 

CITATION CITATION INFO. CITATION, KEYWORDS, FULL 
INFORMATION + KEYWORDS + ABSTRACT RECORD 

PERFORMANCE — ' 

LEVEL ^ 31.7% 54.5% 76.3% 100% 

COST $36.00 $43.00 $74.00 $100.00 

OVERALL PERFORMANCE/COST FOR SEARCHES OF FILES AT GIVEN LEVELS OF 
RECORD COMPLETENESS (USING CSC COST FIGURES) . 

Experiment #2 - Effect of Indexing Methodology on Retrieval Efficiency 

This experiment related indexing methods to retrieval efficiency. 
The major portion of the work was carried out on two files. The CA 
Condensates file was used to represent unstructured, uncontrolled in- 
dexing and the CASIA file (the time-ordered version of the CAS 
Integrated Subject File) was used to represent controlled unstructured 
indexing . 

O 



To obtain a baseline measure for appe^irance of terms In 
given data elements, both the Condensates and the CASIA files 
were searched for single terms. The percent found uniquely for 
each data element was recorded for each, term. Data for this 
study are given In Flgxire 2. 

One year of both Condensates and CASIA were searched. It 
was shown that the closer a term was to a chemical name, the more 
often It was found In CASIA . On an overall average, for thlS sam- 
ple of terms the Condensates keywords were the more dlscrlMnatory 
field. — 

One very Important fact emerged. CASIA did not replace ci- 
tation data and keywords for search purposes. This confirmed 
previous findings by both IITRI and the University of Georgia In 
their studies of the CAS Integrated Subject File (ISF) . While 
subject Indexing was of benefit when Searching chemical names, It 
was poorer than citation and keyword Information for searching 
subject concepts. 

Searches were made of the CASIA file for the questions prey- 
vlously searched against the other versions of the data base. Two 
Important facts were obvious from the results: ^ 

• many more records were extracted from the data base, 
which had not been found via searches of the citation 
and keyword data, and 

• very few of the records identified via citation and key- 
word searches were the same as those Identified vl£. the 
index search. 

Whl]^e a tptal of 7128 records had been Identified by searches of 
all versions of the citation and keyword Information, the searches 
of the Index file (CASIA) Identified 4988. But only 828 of these 
4988 were common with those contained In the 7128. Thus, there 
were actually 11,288 Identified records from the sum of the searche 
In point of fact, searches of citation and keyword data alone per- 
form considerably better (63.15%) than those of Index data alone 
(44.19%). To Insure complete retrieval, both are required. 
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TITLE 
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ABSTRACT 






*Prostag I and i n* 


32. 


54 


32 


.92 


15 


.84 


18 


.58 


*Penlcl 1,1 in* 


1 1 . 


03 


29 


.31 


33 


.45 


26 


.03 


Norepinephri'n* 


9. 


32 


20 


. 80 


30 


.33 


39 


.45 


*Dopa* 


16. 


33 


• 22 


.55 


32 


.50 


28 


.54 


Teratogen* 


5 . 


32 


57 


. 79 


36 


.50 


0 


.00 


L-Dopa* , I 


22. 


34 


38 


.83 


38 


.46 


0 


.00 


"Oroxyphenylalanlne 


36. 


36 


6 


.06 


18 


.18 


36 


.36 


Neurotransmi tt* 


23. 


21 


42 


.86 


32 


.14 


0 


.00 


Biogenic amine* 


35. 


96 


38 


.20 


24 


.72 


0 


.00 


*N i X. roso* 


1 1 . 


6) 


)6 


.08 


23 


.08 


49 


.09 


MEAN: 


20. 


40 


30 


.54 


28 


.52 


. 19 


.81 



* Indicating truncation 



Figure 2 
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1 .32 
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7.51 
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2.79 


0.00 


22 .00 


19.33 


0.00 


2.50 


9.17 


5.83 


0.00 


45.45 


4.55 


0.00 


22.73 


15.38 


33.33 


23.08 


0.00 


22.38 


24. 19 


17.74 


0 .00 


0.91 


0 .00 


3.64 


20.45 



1 



9.68 10.10 !0.I5 



7.59 



Experiment - Effect of Information Disglajy^ on Relevancy Judgement 



The output from all the search types conducted in Experiment #1 
was summed. Each was printed in three formats: 

• titles and citations only 

9 titles I citations and keywords » and 

• titles, citations, keywords » index terms and abstracts 

Printouts were distributed to IITRI scientists in such a way: that 
each display mode 'for each profile was evaluated by a different 
scientist, ^ 

The results were consistent. Two profiles had no hits and were 
discounted. Of the remaining 21 profiles » 15 showed one pattern 
and six showed another. The most common pattern, obtained in 15 
of the 21 profiles » was that the Titles Only display mode gave 
the pootest Recall* and highest Noise while the Title Plus Keyword 
display mode gave better Recall and less Noise , The All Fields 
display mode, by definition » had total Recall and no Noise, In 
g enerait however > the Trn,^Otn>l?'~ ^tid:^ TiTt^Pl^ display " 

modes were fairly similar and poor in relationship to All Fields 
display mode. This strongly indicates the need for full index terms 
and abstracts in display of records to assure good relevance judgement. 

The other six profiles, while also showing relatively poor per-^ 
formance by both the Title Only and the Title Plus Keyword display 
modes, showed a seeming anomaly in that the Title Only display mode 
resulted in better relevancy judgements than the Title Plus Keyword 
display* mode. Analysis of the profiles provided the answer t They 
were similar in that Title Only left a number of ambiguous cases » so 
those were selected. The keywords, here» worked only in a negative 
sense. They removed some ambiguous cases » but didn't add more speci- 
fic records. 



*Recall is a measure of the degree of potential performance Citrelevant 
records selected) , 

*Noise is a measure of confusion (irrelevant records erroneously 
selected) , 
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Experiment #3 strongly indicates the need for full abstracts 
as well as citation information i titles, keywords and index terms 
as display items* While keywords improve titles somewhat, a 
large percentage of relevant records will be missed if abstracts 
and index terms are not present In the display. 

Major Program Implication 

• For efficient retrieval evaluation, a display including 
full citation, keywords and abstract is necessary* 

• However, good search results can be obtained from a 
system with titles and keywords available for searching* 
The addition of abstracts to the searching field, while 
increasing the search capability some, greatly increases 
data base preparation, up-date and manipulation costs. 

• A cost-effective, efficient search system should have a 
capability to search titles and keywords, combined with . 
a display capability including full citations, keywords 

and-^abstr-ae-ts-^^ 

• Introduction of the CASIA file for on-line searching 
would offer a valuable tool to the chemical research 
community. It inay be possible to extend this stats 
ment to indicate that index information in general (for 
any data base) will enhance the utility thereof, but the 
data were only obtained for a chemical data base. 

ENHANCING THE RETRIEVAL EFFECTIVENESS OF LARGE INFORMATION SYSTEMS 

This project was begun in June of 1975 and is scheduled for 
coinpletion by November of this year. 

Program Goa l 

The research goal is to improve computer^search quality/effi- 
ciency for bibliographic files. The original proposal emphasized 
a two-step process: 

1) A standard Boolean (or other)search of high recall result 
ing in a large initial retrieved set (RI) * 

2) A cluster analysis of RI to sort records into categories 
so as to reduce user evaluation time without sacrificing 
quality (precision and recall)* 

9 



Prior to this grant, IITRI had written several clustering 
programs Incorporating unique criteria for term associations. 
Initial goals of the grant were tc test these programs In a statls* 
tlcally meaningful manner using Chemical Abstracts and Eg glneering 
Index > Initial results Indicated that the disparity between ma-^ 
chine and human relevancy judgements contains two large factors 
that are amenable to machine solution at a level less complex 
than syntax analysis. The two factors are: 

• Term synonyms (several teims with similar meanings) 

• Term ambiguity (one term with different meanings depend^ 
Ing on context) 

These two factors l.mlt search quality for Boolean and for 

clustering methods. For "he former, the user must try to specify 

all synonyms In the original profile * which Is a task of diminish* 

Ing returns since some of the synonyms will occur only at very low 

frequency. For clustering, the algorithm attempts to Identify 

synonyms ba sed o n the occu r r en ce patterns of words .^JWhile^tt^oes. . 

work. It Is also clear that It cannot be perfect because the occur^* 

rence patterns of words do not contain enough Information, for the 

small retrievals Involved, to define synonyms precisely. Thus, 

two different means of overcoming these two word definition problems 

are being explored. These methods may be Incorporated Into either 

clu£ytered or non-clustered retrievals: 
. 

• Have the user evaluate a sorted list of terms derived 
from RI (as a part of a standard search) 

• Construct a term map so that synonyms and ambiguities 
may be automatically simpliJ:led. 

Both of these methods may prove to be compatible with an on- 
line environment. 

It seems probable that the future of information^ retrieval from 
bibliographic files lies in the direction such that machines will 
more closely approximate the processes that occur in the mind of the 
manual searcher. Historically, the progression has been: 

• look for the occurrence of a list of words 

• look for combinations of words from a list 
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m group citations according to combinations of words 

and present the groups to the user (standard clustering) 

• group citations obtained by test against a list of words 
according to the combinations and present them to the 

^ user (IITRI Algorithm) 

Our current activity adds to this list; 

• group citations obtained by a test against a list of words 
according to the combinations ^ taking into account synonyms 

- ' and ambiguity. 

This last step is in the direction of syntax and meaning because 
it involves enabling the computer to wor^^ with definitions. It is 
related to automatic indexing and may provide a mechanism for auto- 
matic or assisted profile generation. As the\cost of computer storage 
and operation continues to fall^ relative to othet costs, the nxunber 
of operations per search that are economic, rises^^^and it seems but a 
matter of time until the computer operates at a level of syntax/meaning. 

Current Activities 

Testing of the Clustering Algorithm 

On th^i basis of some preliminary clustering runs against Engi- 
neering Index and Chemical Abstracts criteria were established for 
evaluation of the Algorithm. That is, what kind of profiles (Boolean 
terms and logic) and retrievals should be used to test the sensitivity 
of the Algorithm to jargon, relatedness of concepts and retrieval size. 
These decisions have been largely completed. 

Characteristics of the Initial Retrieved Set - RI 

While IITRI ^has studied term frequencies for whole data bases, 
it has not previously studied the distribution of the vocabulary with- 
in the set RI, the initial retrieval. It was expected that the 
distribution of the "found" vocabulary would be very different from 
the vocabulary of the whole data base that would have high relative ^ 
frequencies for terms related to the search terms* Thus, programs 
were run to generate some sample distributions from Chemical Abstracts 
and Engineering Index retrievals. The results showed that the relative 
frequencies were too low for direct user evaluation and that an inter- 
mediate mapping or grouping is required. 



Vocabulary Decomposition 

In an effort to design a module to enable the user to make 
intermediate vocabulary judgements (i*e. evaluate the found vocabu- 
lary) it is desirable to know whether there are any simple rules that 
distinguish the key words from the others* In an effort to charac- 
terize those words t we have manually analyzed a retrieval set and 
isolated the minimum set of words on which an accurate relevancy 
judgement could be made. We are currently examining that vocabulary 
in detail * 

Planned Activities 

Preliminary findings indicate that what is required is a term 
map constructed manually on the basis of meanings that can map a 
specific term such as "gimbals" back to the level of "navigation**. 
That ia, the map would project the found vocabulary up the hierarchy 
towards greater generality. At the more general levels^ term fre* 
quencies would be expected to be greater » so that the nxunber of user 
evaluations- that woulxi-be"requtre'd "w 

Another fact of the word map/projection process is that it 
would allow the found vocabulary to be sorted to that link to which 
it is relevant. That ia, suppose in an A & B type search, the A 
terms are plants and the B term are air pollutants, The program may 
find *'Tree" in the foxmd vocabulary and it could then associate It 
via the map with the A link. 

If the word map and the term link assignment are available, the 
scenario of a search would then be as follows: 

1. User specifies links and logic (example A & B) 

2. Computer finds initial retrieved set (RI) 

3. Computer finds foxmd vocabulary of RI (example Tree) 

4. Computer uses word map to reduce found vocabulary to 
an appropriate level of generality 

5. ^Computer groups fotind vocabulary according to links 

6. User specifies link to be expanded (example - breakdown 
by plant terms and keep all air pollutant terms) . 

7i. Computer prints out & list of the retrieved sets (Rn) 

to be obtained for each of the examples of found vocabu-* 
lary associated with the A link. 
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1) Tree 167 

2) Bush 202 

(etc.) ' (etc.) 

8) Usier specifies which of the Rn subsets he wishes to 
obtain and nas those printed* 

This scenario obviates many of the problems* The synonym 
probl^ is handled explicitly by the word map. Term ambiguity 
may be handled by building limited associations into the map, 
the key questions now are: 

1. Are our preliminary results of general validity? 

2. Can the required file acc^'ss and the computations be 

be done In times compatible with an on-line environment? 

3* How expensive would It be to construct a functional 
word map? 

We will continue work toward definitive answers to the.^e questions* 
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